-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25676][SQL][TEST] Rename and refactor BenchmarkWideTable to use main method #22823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bin/spark-submit --class <this class> <spark sql test jar> -> bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this is not a good way, cc @dongjoon-hyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding val splitThreshold = SQLConf.get.getConfString("spark.testing.codegen.splitThreshold", "1024").toInt to our To run this benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyum Thanks for the suggestion! You prefer to modifying CodeGenerator.scala each time run this benchmark, right? I feel it could be kind of tricky to let user modify codes and if the CodeGenerator.scala changes in the future, it is hard to update the document here. @dongjoon-hyun @gengliangwang any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we need advice from the right persons. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I don't think this is a good solution. We should start a new discussion about whether to make it configurable in production as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"10", "100", "1024", "8196", "65536" -> "10", "100", "1024", "2048", "4096", "8196", "65536"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BenchmarkWideTable -> WideTableBenchmark?
|
Test build #98013 has finished for PR 22823 at commit
|
f8bce47 to
b00396f
Compare
|
Thanks @wangyum for good suggestion! |
|
Test build #98018 has finished for PR 22823 at commit
|
|
Test build #98022 has finished for PR 22823 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @davies and @cloud-fan and @kiszk .
This benchmark is added in Spark 2.1.0. This value 1k is determined by manually changing the split threhold.
This PR wants to add a configuration in CodeGenerator.scala for testing-purpose only.
- Is the configuration helpful in general purpose?
- If then, can we make another PR for that first?
- If not, is it allowed to add this testing parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a PR to add this config officially. It should be useful for performance tuning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the decision, @cloud-fan !
@yucai . Please proceed to propose a new PR for only that new configuration (if you didn't start yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I am working on #22847.
|
Test build #98208 has finished for PR 22823 at commit
|
…nction configurable ## What changes were proposed in this pull request? As per the discussion in [apache#22823](https://github.com/apache/spark/pull/22823/files#r228400706), add a new configuration to make the split threshold for the code generated function configurable. When the generated Java function source code exceeds `spark.sql.codegen.methodSplitThreshold`, it will be split into multiple small functions. ## How was this patch tested? manual tests Closes apache#22847 from yucai/splitThreshold. Authored-by: yucai <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
0cfac24 to
83857d0
Compare
|
@dongjoon-hyun Just push the rebased version, thanks! |
|
Thank you, @yucai . Could you update the title because we are renaming it? Maybe, |
|
Test build #98501 has finished for PR 22823 at commit
|
Update result
|
Test build #98509 has finished for PR 22823 at commit
|
| split threshold 10 38932 / 39307 0.0 37128.1 1.0X | ||
| split threshold 100 31991 / 32556 0.0 30508.8 1.2X | ||
| split threshold 1024 10993 / 11041 0.1 10483.5 3.5X | ||
| split threshold 2048 8959 / 8998 0.1 8543.8 4.3X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun In my mac, at most case, 2048 is the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it with openjdk (in one Linux VM), 2048 is also the best.
================================================================================================
projection on wide table
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.11.1.el7.x86_64
Intel Core Processor (Haswell)
projection on wide table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
split threshold 10 23995 / 25673 0.0 22883.7 1.0X
split threshold 100 12881 / 13419 0.1 12284.3 1.9X
split threshold 1024 6435 / 7402 0.2 6137.2 3.7X
split threshold 2048 5861 / 6766 0.2 5589.2 4.1X
split threshold 4096 6464 / 7825 0.2 6164.6 3.7X
split threshold 8192 7886 / 8742 0.1 7520.7 3.0X
split threshold 65536 46143 / 48029 0.0 44005.6 0.5X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yucai . I trust you. :) Don't worry about that. The scope of this PR is not for choosing the best option.
|
Test build #98519 has finished for PR 22823 at commit
|
|
retest this please |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yucai , @cloud-fan , @gatorsmile , @gengliangwang .
This PR looks good to me. This refactoring is not for choosing the best default value for the configuration. It's beyond of the scope.
The result on different machines(JVM/CPU/DISK) are not the same. This PR clearly shows that the benefit of the previous @yucai 's configuration PR, #22847. Users are able to run some benchmarks (including this) with various configurations on their machine and workloads, and to choose their best values with that configuration.
|
Test build #98522 has finished for PR 22823 at commit
|
|
Merged to |
|
Thank you all. |
| Seq("10", "100", "1024", "2048", "4096", "8192", "65536").foreach { n => | ||
| benchmark.addCase(s"split threshold $n", numIters = 5) { iter => | ||
| withSQLConf(SQLConf.CODEGEN_METHOD_SPLIT_THRESHOLD.key -> n) { | ||
| df.selectExpr(columns: _*).foreach(identity(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, All.
It turns out that this breaks Scala-2.12 build. I made a PR to fix that. #22970
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks!
…nction configurable ## What changes were proposed in this pull request? As per the discussion in [apache#22823](https://github.com/apache/spark/pull/22823/files#r228400706), add a new configuration to make the split threshold for the code generated function configurable. When the generated Java function source code exceeds `spark.sql.codegen.methodSplitThreshold`, it will be split into multiple small functions. ## How was this patch tested? manual tests Closes apache#22847 from yucai/splitThreshold. Authored-by: yucai <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…e main method ## What changes were proposed in this pull request? Refactor BenchmarkWideTable to use main method. Generate benchmark result: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.WideTableBenchmark" ``` ## How was this patch tested? manual tests Closes apache#22823 from yucai/BenchmarkWideTable. Lead-authored-by: yucai <[email protected]> Co-authored-by: Yucai Yu <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Refactor BenchmarkWideTable to use main method.
Generate benchmark result:
How was this patch tested?
manual tests